Initial analysis and clustering of the case-only data
Given the same data, the following commands should yield almost identical results to the results being shown with any discrepancies being stochastic in nature. Some of these commands, however, can take a long time to run, so while we show the commands here as we originally ran them, results are often being read back from file. If you are trying to recapitulate these results using the exact data and code being used here and are running into problems or incongruitous results, please submit an issue, and we will address it as soon as possible.
All data analysis was done on a desktop with 8 cores/16 threads (AMD Ryzen 7 1800x) and 32 GB of DDR4 memory. We begin by setting a “cores” variable for future use.
cores <- 8
Next is a general purpose naming function that will be used to generate any paths seen below:
name <- function(path = "/home/brian/Desktop/flow", folder = "cases", file, type = "rds") {
paste0(path, "/", folder, "/", file, ".", type)
}
These are the packages I will be using.
library(PreciseDist)
library(readr)
library(future)
library(doFuture)
library(heatmaply)
This is the session info.
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] heatmaply_0.15.2 viridis_0.5.1
[3] viridisLite_0.3.0 plotly_4.7.1
[5] ggplot2_3.0.0 doFuture_0.6.0
[7] iterators_1.0.10 foreach_1.4.4
[9] future_1.9.0 readr_1.1.1
[11] PreciseDist_0.0.0.9000
loaded via a namespace (and not attached):
[1] R.utils_2.6.0 tidyselect_0.2.4
[3] htmlwidgets_1.2 TSP_1.1-6
[5] trimcluster_0.1-2 grid_3.4.4
[7] ranger_0.10.1 Rtsne_0.13
[9] munsell_0.5.0 codetools_0.2-15
[11] SNFtool_2.3.0 miniUI_0.1.1.1
[13] misc3d_0.8-4 withr_2.1.2
[15] colorspace_1.3-2 longitudinalData_2.4.1
[17] Boruta_6.0.0 knitr_1.20
[19] geometry_0.3-6 stats4_3.4.4
[21] robustbase_0.93-1 dtw_1.20-1
[23] dimRed_0.1.0 DistatisR_1.0
[25] listenv_0.7.0 radix_0.5.0.9001
[27] DistributionUtils_0.5-1 Ckmeans.1d.dp_4.2.2
[29] rprojroot_1.3-2 locpol_0.7-0
[31] ipred_0.9-6 randomForest_4.6-14
[33] gclus_1.3.1 diptest_0.75-7
[35] R6_2.2.2 seriation_1.2-3
[37] fields_9.6 rpivotTable_0.3.0
[39] flexmix_2.3-14 manipulateWidget_0.10.0
[41] DRR_0.0.3 bitops_1.0-6
[43] assertthat_0.2.0 promises_1.0.1
[45] networkD3_0.4 SDMTools_1.1-221
[47] scales_1.0.0 nnet_7.3-12
[49] mmtsne_0.1.0 gtable_0.2.0
[51] ddalpha_1.3.4 globals_0.12.1
[53] spam_2.2-0 timeDate_3043.102
[55] rlang_0.2.2 CVST_0.2-2
[57] RcppRoll_0.3.0 profileModel_0.5-9
[59] splines_3.4.4 lazyeval_0.2.1
[61] ModelMetrics_1.1.0 princurve_2.1.0
[63] checkmate_1.8.5 trelliscopejs_0.1.14
[65] broom_0.5.0 heatmap.plus_1.3
[67] rgl_0.99.16 yaml_2.2.0
[69] reshape2_1.4.3 abind_1.4-5
[71] threejs_0.3.1 crosstalk_1.0.0
[73] backports_1.1.3 httpuv_1.4.4.2
[75] caret_6.0-80 tools_3.4.4
[77] lava_1.6.2 infer_0.3.1
[79] gplots_3.0.1 RColorBrewer_1.1-2
[81] proxy_0.4-22 BiocGenerics_0.24.0
[83] analogue_0.17-0 Rcpp_0.12.18
[85] splus2R_1.2-2 plyr_1.8.4
[87] visNetwork_2.0.4 base64enc_0.1-3
[89] progress_1.2.0 purrr_0.2.5
[91] prettyunits_1.0.2 rpart_4.1-13
[93] diffusr_0.1.4 zoo_1.8-3
[95] sfsmisc_1.1-2 cluster_2.0.7-1
[97] magrittr_1.5 data.table_1.11.4
[99] TSclust_1.2.4 mvtnorm_1.0-8
[101] whisker_0.3-2 matrixStats_0.53.1
[103] hms_0.4.2 NetPreProc_1.1
[105] mime_0.6 evaluate_0.11
[107] xtable_1.8-3 mclust_5.4.1
[109] gridExtra_2.3 compiler_3.4.4
[111] tibble_1.4.2 maps_3.3.0
[113] mgc_1.0.1 KernSmooth_2.23-15
[115] crayon_1.3.4 R.oo_1.22.0
[117] htmltools_0.3.6 mgcv_1.8-23
[119] later_0.7.3 tidyr_0.8.1
[121] RcppParallel_4.4.1 lubridate_1.7.4
[123] magic_1.5-8 fpc_2.1-11
[125] autocogs_0.0.1 MASS_7.3-49
[127] Matrix_1.2-14 permute_0.9-4
[129] gdata_2.18.0 wmtsa_2.0-3
[131] R.methodsS3_1.7.1 dotCall64_1.0-0
[133] bindr_0.1.1 gower_0.1.2
[135] igraph_1.2.1 ifultools_2.0-4
[137] pkgconfig_2.0.1 registry_0.5
[139] brglm_0.6.1 ExPosition_2.8.19
[141] philentropy_0.2.0 microbenchmark_1.4-4
[143] recipes_0.1.3 clv_0.3-2.1
[145] webshot_0.5.0 prodlim_2018.04.18
[147] LPStimeSeries_1.0-5 stringr_1.3.1
[149] digest_0.6.18 pls_2.6-0
[151] vegan_2.5-2 graph_1.56.0
[153] rmarkdown_1.10 dendextend_1.8.0
[155] uwot_0.0.0.9004 kernlab_0.9-26
[157] gtools_3.8.1 modeltools_0.2-22
[159] shiny_1.1.0 nlme_3.1-131
[161] glasso_1.10 jsonlite_1.5
[163] bindrcpp_0.2.2 alluvial_0.1-2
[165] TSdist_3.4 pillar_1.3.0
[167] lattice_0.20-35 httr_1.3.1
[169] DEoptimR_1.0-8 survival_2.41-3
[171] glue_1.3.0 xts_0.11-0
[173] prabclus_2.2-6 class_7.3-14
[175] stringi_1.2.3 pdc_1.0.3
[177] KODAMA_1.5 rsample_0.0.2
[179] caTools_1.17.1 dplyr_0.7.6
[181] hglasso_1.2
The first step is to take our flow case data and calculate as many distances, similarities and correlations as possible. Some of these calculations may take seconds while others may take hours, so we generally run this overnight and then kill the function if it is still running in the morning (if the file parameter is set, once a distance finishes it is written to file, so killing the function does not harm already written data).
library(future)
library(doFuture)
options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = cores)
flow_case_dists <- flow_cases %>%
as.matrix() %>%
precise_dist(
dists = "all_dists",
suffix = "",
file = name(file = "flow_case_dists"),
parallel = TRUE,
local_timeout = Inf,
verbose = TRUE
)
Now we coerce all non-distances (i.e. similarities and correlations) into distances.
flow_case_distances <- flow_case_dists %>%
precise_transform(enforce_dist = TRUE)
Now that we have a large list of distances, we can feed each distance as input into the UMAP algorithm.
options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = 4)
flow_case_umap <- precise_umap(
data = flow_case_distances,
distance = TRUE,
n_neighbors = 5,
spread = 10,
min_dist = 0.1,
bandwidth = 1,
type = "plotly",
color_vec = NULL,
colors = NULL,
parallel = TRUE,
verbose = TRUE
)
Once we have a list of UMAP results, we will feed it to precise_trellis(), which will automatically take our list of UMAP results and output a visualization for each result. The point of this is to look for UMAP results that indicate the calculated distance is capturing some amount of intrinsic structure within the data.
NOTE: This is an interactive visualization. You can also maximized and minimize it by clicking on the juxtaposed arrows in the bottom right hand corner of the vizualization.
precise_trellis(
data = flow_case_umap,
name = "UMAP of Every Flow Distance",
path = name(file = "trellis_flow_case_umap"),
self_contained = TRUE
)